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ABSTRACT 

This study empirically determined the optimizing 
weight to be applied to the Wrongs Total Score in scoring rubrics of 
the general form = R - kW^ where S is the Score^ R the Rights Totals 
)c the weight and W the Wrongs Totals if reliability is to be 
maximized. As is well known^ the traditional formula score rests on a 
theoretical framework which is of dubious validity. Two instruments, 
variant approaches to the assessment of mathematical knowledge, were 
administered to approximately 1,700 entering college freshmen during, 
an orientation period. The method consists of an iterative computer 
procedure for calculating split-half reliability of tie tests as the 
weights are systematically varied throughout the region of 
maximization as determined by essentially canonical approaches. The 
results indicate that in contrast to the negative weight for the a 
priori formula score, a sizable positive weight maximizes 
reliability. The implications for rate of work as the single iuost 
reliable aspect of test performance seem clear. The validity of much 
educational testing rests on assumptions of fairness to those tested, 
c\chieved through optimization of standardized conditions. The study 
suggests that factors which alter rate-'of- work characteristics of 
performance may be most detrimental to candidate success. 
(Author/DEP) 
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An Optimizing Weight for "Wrongs*' Scores^v 

In scoring a multiple-choice test, the ''formula score" or "correction 
for guessing" is the most widely used alternative to the simple count of 
the total number of right answers. The formula is 

F.S. =R " 



k-1, 
where 

F.S. = Formula Score 

R = Total number right 
W = Total number wrong 
k = Number of choices per test item 

The basic assumption which underlies this formula is that responses 

fall into two categories: those based on knowledge sufficient to determine 

a correct answer, and those based on knowledge insufficient to provide 

any basis for response better than chance responding. The value of R, the 

total number right, is a combination of the two categories, but the value 

of W, the total number wrong, reflects only responses based on insufficient 

information. The size of W is used to "correct" the observed value of R, 

to estimate the true value of the number of responses based on knowledge, 

for the chance behaviors are assumed to be randomly spread equally across 

k-1 

the k choices per item, so that — ^ — of them will be wrong answers, 

K. 

summing to the observed W score, and ^ of them will be right answers, 
"buried" in the R score. The ratio of "buried" wrong answers to '^observed 
wrong answers is thus 7—, 

K.~ J. 

Thorndike (1971) has discussed this correction, emphasizin)2; its 
logical flaws and some of its merits. Ebel (1972) has presented research 
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cVjudence on the superior reliability of tests when they are scored with 
a formula correction. More recently, Lord (1975) has focussed on 
examinee oehavior under different sets of instructions: formula scoring 
directions and number-right directions. He states an assumption that 
under number-right scoring candidates replace '*Omit" responses by random 
marks on the answer sheet. The impact of this random responding is to 
reduce the sampling error of the formula score when contrasted with the 
number right score. This point is established by considering not 



unanswered, and R and k are as before. It has long been known that 
since R, W, and 0 sum to a constant, (T, the total number of items) 
the two values of the formula scores, F.S. and F.S.', are perfectly 
correlated. 

But the assumption of random responses is not an attractive one. 
Lord is clearly concerned that the assumption be recognized for its 
crucial role and that instructions be developed to insure that any 
omissions under formula scoring are trul 7 items for which candidates 
have only a chance, random, potential for success. But the theory is 
not strongly substantiated by our evidence on candidate behavior. Guessing 
on tests is in the main not random activity. 

If the theoretical underpinnings of the formula score are so unattrac- 
tive, why are we constrained to the weight, , which it leads to for 

K.— -L 

W? What other weight might we use, and to what purpose? One purpose 
might clearly be the development of a maximum reliability for the score 
from a test. In an unpublished study by Fischer and Jackson (1971), 




W, but F.S. ' = R + 



where 0 = the number of items 
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the maximization of reliability was taken as the rationale for determining 
the best weight, x , for the wrongs. Taking Dressel's (1940) formula 
for the Kuder-Richardson reliability of a formula-scored test, Fischer 
and Jackson differentiated the equation with respect to the weights for 
the wrong answers when the right answers are weighted unity. That is, 
defining a weighted score as 

W.S, = R -f xw 

where x may take any value, positive or negative, for what value is 
the reliability of the W.S,, the weighted score, a maximum? 

Somewhat to their surprise the authors found that the value of x 
was positive ; the sum of the rights and a fraction of the wrongs was the 
most reliable score. Further, the Rights score alone was more reliable 
than the conventional formula score in each of four separately — timed 
subtests, comprising a form of the College Board Scholastic Aptitude Test 
(SAT) , were two verbal and two mathematical sections with x-values of 
+ -295 and + ,585 for the mathematical material and + .639 and + ,720 
for the verbal. 

Lord, in discussing this result observed that "This does not mean 
that we should give bonuses for wrong answers. It merely means that that 
trait of omitting items is a trait that can be quite reliably measured." 
This trait of omitting items, however, may be the trait of working on 
test material with a consistent speed. Lord, states in his discussion 
that his theoretical development will work best for unspeeded tests. But 
the test studied by Fischer and Jacks on was a standard SAT form, moderately 
speeded. There is a possible difference between omitting an item and 
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not reaching it. In the standard ETS item analysis, an item is considered 
omitted if there is a response to a later item; it is considered Not 
Reached if there are no responses to later items. If the preponderance 
of omitting in Fischer and Jackson's paper was due to a failure to com- 
plete the test, to Not Reaching, this would be evidence that the trait 
which is reliably measured is rate of work, not tendency to omit due to 
conservatism or caution. 

Fischer and Jackson used a generalized internal consistency approach, 
via Dressel's formula, and determined the maximum reliability by differ- 
entiation with respect to the weight for wrongs. The present study 
extends this work by an empirical determination of the correlation between 
two half tests on two 50-item mathematics tests. Each half test was 
scored R -f kW , (k here is simply the weight in wrongs, exactly equiv- 
alent to Fischer and Jackson's x) and the correlation between them 
computed. This was systematically followed throughout the region 
- 5.0 < k < 5.0. The result was the two empirical curves presented 
in Figure 1 and Figure 2, Each cf these curves shows a maximum for a 
positive weight somewhat less than unity. Tables 1 and 2 provide the 
data upon which the graphs were based* 

This result supports the finding of Fischer and Jackson. The two 
curves reflect slightly different treatments, however. The curve in 
Figure 1 was based on a 50-item mathematical test which consisted of data 
sufficiency items. The curve in Figure 1 is based on a 50-item mathematical 
test which consisted of "regular math" problems. The data sufficiency 
items have the form of two statements and a question. The respondent is 
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to indicate whether or not there is sufficient information in the statements 

to answer the question. An example would be as follows: 

If X is a whole number, is it a two-digit number? 
2 ■ 

(1) X is a three-digit number. 

(2) lOx is a three-digit number. 

(A) if statement (1) ALONE is sufficient but statement 

(2) alone is not sufficient to answer the question asked, 

(B) if statement (2) ALONE is sufficient but statement 

(1) alone is not sufficient to answer the question asked, 

(C) if both statements (1) and (2) TOGETHER are sufficient 
to answer the question asked, but NEITHER statement ALONE 
is sufficient, 

(D) If EACH statement is sufficient by itself to answer the 
question asked, 

(E) if statements (1) and (2) TOGETHER are NOT sufficient 
to answer the question asked and additional data specific 
to the problem are needed. 

This difference in the item format was accompanied by differences in 
test content. The data sufficiency material was parallel in content to the 
College Board SAT, which used about 30% items of this type at that time. 
The regular math test was parallel to the College Board basic-level achieve- 
ment test in mathematics. This test has a more advanced content than the 
Scholastic Aptitude Test. 

A third difference between the two tests (in addition to format and 
content) concerns the development of the half-tests. The data sufficiency 
test was developed as two separately timed subtests of 25 items each. 
These were the two half-tests correlated in the current study. The mathe- 
matics achievement test was administered with a single time limit and 
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divided into two half tests consisting of all the odd items and all 
the even items. 

The role of these different factors on the somewhat different 
outcomes for the two tests is difficult to determine. The maximum 
value for the data sufficiency test was approximately + 0.90 as a 
weight for the wrongs. The maximum value for the regular math test was 
+ 0.70. These empirical values contrast with the values of -f 0.295 
and + 0.585 observed for SAT. mathematics subtests in the Fischer and 
Jackson study. 

Table 3 presents the means, standard deviations and intercorrela- 
tions for the four half-tests considered in the study. The pattern of 
intercorrelation is consistent across the two tests. The interhalf 
reliability of the data sufficiency Rights score was .67 , versus the 
value of .81 for the math achievement. Similarly the wrongs score 
for the math achievement test was more reliable, .73 versus .62. 
The cross-score correlations, - and R^ were - .46 

and - ,45 for the data sufficiency test and - .34 and - ,35 
for the math achievement; with similarly lower cross-score correlations 
for the intratest comparisons (R^ - > ^2 ~ ^2^ math 

achievement test. 

While omits were not distinguished from Not Reached in the present 
study, the general trait of omissiveness can be gauged somewhat by 
considering the numbers of items not responded to in each of the four 
half-tests studied. The values can be derived from Table 3 as follows: 
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Table 3 

Means, Standard Deviations and Inter correlations 





for 


the Half-Test 


Scores 








Data, Sufficiency 






h 


1 


2 


^^2 




1.00 


-0.75 


0.67 


-0.46 


^1 


-0.75 


1.00 


-0.45 


0 . 62 


R. 


0.67 


-0.45 


1.00 


-0.79 


^2 


-0.46 


0.62 


-0.79 


1.00 


Mean 


12.15 


11.02 


12.26 


11.24 


S.D. 


3.82 


... 3.57 


3.50 


3.36 






Math Achievement 








^1 


\ 


^2 




1.00 


-0.48 


0.81 


-0.34 


\ 


-0.48 


1.00 


-0.35 


0.73 


^2 


0.81 


-0.35 


1.00 


-0.51 




-0.34 


0.73 


-0.51 


1.00 


Mean 


12.51 


"6.70 


12.97 


6.60 


S.D. 


4.43 


3.75 


4.57 


3.67 
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Average Number of Items Not Responded To 

Data Sufficiency Half -Test 1 1.82 

Data Sufficiency Half-Test 2 1.50 

■ Math Achievement Half -Test 1 5-79 

".yV Math Achievement Half-Test 2 5.42 

^%>>-^^Ciearl^^ mathematics achievement, test was characterized by a ' 

^l^^fe*^^^^ omit. ' Whether this was due to its greater speededness 

or to a true lack of knowledge of the material on the part of the subjects 
cannot be determined from this data. Either is plausible, since it is 
^ a characteristic of data sufficiency items that they are processed more 



. and Jackson weights f or 

. ' ' mathematics testsj which were i 295; and ' .585 , the higher weight was 



mathematics testsj which were i 295; and '.585 , the higher weight was 
achieved by the section which had a sizable set of data sufficiency items 
* (18 of its 35-item total) and a slightly more generous time allotment, 

. .77 seconds per item versus . 72 seconds per item. This suggests that 
S'ii^-- the weight approaches unity as the. testiis unspeeded. However , the general 

i viiP^''-----Vi' ' ' ' '"''^ ' ' . ' . 

I '"^0'%'-'' '"parity of? the number correct on the various half-tests, versus the differ- 
ences in number wrong, suggests that there may be a greater tendency to 
' give a response to the data sufficiency items, to guess at an answer, than 



'4% ' to respond to the mathematics achievement items. This implies a more 



:J^^0^-f^' complect cause for the differences in weight than simply rate. of work. 



^MM-' ■ ' 

" ^y\'''^t-'^^^:^lt ±B interesting to, contrast . the. curves in Figures 1 and 2 with one 
provided by Fischer and Jackson, presented as Figure 3. In the present 
s tudy, "using. empirically determinal curve, there is ho suggestion of the 
W^^^^^i Jiiinimum point* for., reliability which is clear in the Fischer and Jackson 
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is not clear since no theoretical analysis of the intercorrelations of 
the tests in this study was undertaken. 

Findings of a maximized reliability through a positive weight would 
seem to indicate that the most reliable aspect of a test performance is 
the total number of marks which are made. This hardly seems a worthwhile 
characteristic to focus on, since it would have little implication for 
validity. However, it is possible that further study of omissiveness 
would lead to an understanding of the reliability of the two forms of 
omissiveness: Omits and Not Reached. The best current data on this 
reliability is available from a study by Flaugher, Melton and Myers (1966), 
which shows the correlation between a mathematical section of the 
Scholastic Aptitude Test and each of four other, parallel sections 
introduced experimentally. The results are summarized in Table 4. 

Table 4 

Parallel Form Reliability for Four Scores: 
Rights, Wrongs, Omits and Not Reached* 

Correlations with Master Form 



Parallel 
Forms 


Rights- 
Rights 


Wrongs- 
Wrongs 


Omits- 
Omits 


Not Reached- 
Not Reached 


1 


.790 


.700 


.628 


.452 


2 


.785, 


.713 


.536 


.485 


3 


.776 


.720 


.648 


.464 


. .4 


.770 


.710 


.576 


.446 



*From Flaugher, Melton and Myers (1966) 

This data suggests that the Not Reached score is not as reliable 
as the Omits score. While this cannot be generalized too broadly, it 
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bears on the meaning of the positive weight for the maximally reliable 
composite score. To the extent that the number of omits on parallel 
forms reflects a reliable tendency not to know a certain proportion of 
the answers, it is surprising that this would be a more reliable charac- 
teristic of an individual than rate-of-work would be. Even with major 
efforts at content and difficulty parallelism, most parallel forms vary 
a good deal, so that one would not readily predict that individuals 
would find comparable numbers of items they would decide not to attempt. 
Further research seems indicated to clarify the degree to which the Omit 
response is determined by rate of work. 

This paper has confirmed the determination by Fischer and Jackson 
of a positive weight for the wrongs as a reliability maximizing score. 
The parallel-forms technique in the present study varied somewhat from 
the internal-consistency approach which they used. The implications of 
this weight, as Lord suggests, are that the trait of omissiveness is a 
reliable one. The source of this reliability and the implications for 
work on test speededness could be meaningful future areas for research. 
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